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ABSTRACT 


In  this  paper  we  describe  the  paracomputer  model  of 
parallel  computation  and  present  paracomputer  algorithms  for 
solving  several  important  problems  including  sorting.  A 
paracomputer  is  essentially  a  traditional  shared  memory 
machine  augmented  with  an  extra  primative  --  the 
"r eplace-add" .  We  show  that  this  primative  enhances  the 
model  by  exhibiting  algorithms  that  are  faster  than  any 
possible  algorithm  for  solving  the  same  problem  on  a  shared 
memory  machine.  In  addition,  several  of  the  algorithms  are 
faster  than  any  known  algorithm  for  solving  the  same  problem 
on  a  shared  memory  machine. 
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I .  INTRODUCTION 

Gottlieb  and  Kruskal  [80,81]  present  supersaturated 
ultracomputer  algorithms  for  solving  a  wide  class  of 
problems,  and  suggest  that  such  algorithms  should  be 
investigated  for  other  parallel  architectures.  In  this 
paper  we  present  and  analyze  supersaturated  "paracomputer" 
algorithms  for  solving  many  of  these  same  problems. 


The  paper  is  organized  as  follows:  Section  II 
describes  the  paracomputer  model  of  computation.  Section 
III  contains  the  notation  necessary  for  our  analyses. 
Section  IV  presents  and  analyzes  several  supersaturated 
algorithms  (with  emphasis  on  nonnumerical  problems). 
Section  V  summarizes  our  results. 
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II.  THE  PARACOMPUTER  MODEL  OF  COMPUTATION 


An  ideal  parallel  processor,  dubbed  a  "p ar acompu t er "  by 
Schwartz  [80] ,  consists  of  identical  PEs  (processing 
elements)  sharing  a  common  memory,  which  they  may 
simultaneously  access  via  three  operations:  read,  write, 
and  replace-add-  The  replace-add  operation,  which  requires 
two  parameters  L  and  E,  increments  the  value  at  the  address 
L  by  the  ir.teger  E  and  also  returns  this  sum  to  the 
executing  PE.  Each  PE  contains  local  registers  and  can 
perform  some  standard  set  of  reasonable  operations  on  them 
(e.g.  including  arithmetic  operations.  Boolean  operations, 
and  comparisons).  At  every  time  step  each  PE,  independently 
of  all  the  others,  performs  some  one  of  these  local 
operations  or  else  reads,  writes,  or  replace-adds  some 
single  memory  location.  Simultaneous  operations  on  a  common 
memory  location  are  effected  (simultaneously)  in  some  serial 
order. 


For  example,  assume  that  during   some   time   step,   PEi 
exe  cu  t es 

r epla ce-add (L  ,  Ei  )  , 
PE j  exe  cu  tes 

r eplace-add  (L  ,  E j  )  , 
and  no  other  operations  are  petformed   on   L.     Furthermore, 
let   V   be   the  value  in  L  prior  to  the  time  step.   Then,  at 
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the  end  of  the  time  step,  L  will   contain  V  +  Ei  +  Ej   and, 

depending   on   the  serial  order  effected,  either  PEi  and  PEj 
will  receive  the  values 

V  +  Ei       and         V  +  Ei  +  Ej 
respectively,  or  they  will  receive  the  values 

V  +  Ei  +  Ej        and         V  +  Ej 
respect  ive ly . 


Although  the  paracomputer  model  of  parallel  processing 
is  physically  unrealizable,  the  "ul t racompu t er  group"  at  the 
Courant  Institute  of  New  York  University  is  designing  a 
parallel  processor  that  approximates  such  a  machine.  A 
crucial  aspect  of  their  design  is  that  multiple  accesses  to 
the  same  location  (including  r epla ce-adds )  are  accomplished 
in  approximately  the  same  time  as  single  access  to  a 
location.  In  Gottlieb  et  al.  [80,81],  the  replace-add 
operation  is  shown  to  be  extremely  effective  in  implementing 
many  important  operating  system  primitives.  (For 
discussions  of  the  actual  architecture  see  also  Gottlieb  and 
Schwartz  [81] . ) 
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III.  DEFINITIONS  AND  NOTATION 


The  size  o  f  a^  parallel  processor  is  the  number  of 
processing  elements  (PEs)  that  it  contains.  We  denote  this 
processor  size  by  P  and  number  the  processing  elements 
PE  ]^  ,  .  .  .  ,  PEp  .  The  size  o  f  (an  instance  of)  a_  problem  is  the 
number  of  data  items  to  be  processed  and  is  denoted  by  N. 

The  time  complexity  of  a  sequential  algor..thm  (i.e.  an 
algorithm  for  a  sequential  processor)  is  a  fi  notion  of  the 
problem  size;  the  time  complexity  of  a  parallel  algorithm 
is  a  function  Tp(N)  of  not  only  the  problem  size  N  but  also 
of  the  parallel  processor  size  P.  In  a  dependent-size 
problem  the  problem  size  depends  on  the  parallel  processor 
size,  i.e.  N  is  a  function  f  of  P;  in  a  supersaturated 
problem,  N  is  larger  than  f(P);  in  a  saturated  problem,  N 
equals  P.  The  term  "supersaturated"  is  used  to  indicate 
that  our  primary  analysis  is  for  N  arbitrarily  large 
relative  to  P . 

For  example,  consider  the  problem  of  summing  N  real 
numbers.  The  obvious  (and  optimal)  sequential  algorithm  has 
complexity  ©(N).  When  N  =  P  a  paracomputer  can  solve  this 
problem  in  time  ©(log  P).*  To  supersaturate  this  problem  we 
let  N  grow.   By  letting  each  PE  independently  determine   the 


*  Summing   P   integers   requires   only   ©(1)    steps    using 
replace-add  (see  section  on  summing). 
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sum  of  at  most  Fn/pI  numbers  and  then  using  the  saturated 
algorithm  to  determine  the  sum  of  these  local  sums,  the 
supersaturated  problem  can  be  solved  in  time 

Tp(N)   =   ©(N/P  +  log  P)   =   ©s(N/P)  . 

The  subscript  S  denotes  the  supersaturation  limi  t  of 
the  indicated  function  of  the  two  variables  N  and  P.  This 
limit  consists  of  taking  asymptotically  large  P  and 
asymptotically  much  larger  N,  and  corresponds  to  first 
choosing  an  arbitrarily  large  parallel  processor  and  then 
loading  it  arbitrarily  heavily.  This  may  be  denoted  by 
1  <<  P  <<<  N,  where  the  triple  inequality  sign  indicates 
that  N  grows  faster  than  any  function  of  P  that  occurs  in 
the  analysis.   For  example, 

N/P  +  log  P   =   ©^(N/P) 


3NP 


(NP) 


PN  /  (N  +  P  log  P)   =   ©s  (P)  . 
Notations  Wg  and  Og  can  be  introduced  in   the   same   way   to 
denote  the  analogous  lower  and  upper  bounds. 


The  speedup  of  a  parallel  algorithm  is   defined   to   be 
the  ratio 

Ti(N)  /  Tp(N)  , 

where  T]^(N)  is  the  complexity  of  the  fastest  known 
sequential  algorithm  for  the  same  problem.  For  example,  the 
speedup  of  the  saturated  summing  algorithm  is 
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§(P)  /  ©(log  P)   =   ©(P  /  log  P)  , 
and   the   speedup   of    the    corresponding    supersaturated 
algorithm  is 
@(N)  /  ©(N/P  +  log  P)   =   ©(N  /  (N/P  +  log  P))   =   ©s(P)  • 

We  say  that  a  problem  is  asymp  totically  comp le t  e ly 
parallelizable  if  some  algorithm  has  maximal  speedup,  i.e. 
its  speedup  is  ©(P).  For  completely  parallelizable  problems 
it  follows  easily  from  the  definition  of  supersat ur a t ion 
limits  that,  for  each  P,  there  exists  N(P)  such  that,  for 
each  N  >  N(P),  the  speedup  is  ©(P).  We  call  such  problems 
comp 1 et ely  parallelizable  a  t  size  N  (P  )  » 


Many  dependent-size  problems,  e.g.  summing,  are 
supersaturated  by  replacing  each  data  item  by  L  data  items. 
We  call  L  the  1  eve  1  of  supersatur at  ion  (as  in  Schwartz 
[80] ).  When  supersaturated  to  level  L,  a  dependent-size 
algorithm  for  a  problem  of  size  f(P),  solves  problems  of 
size  Lf(P).  In  particular,  if  the  original  problem  was 
saturated  (and  thus  f(P)  =  P),  the  resulting  problem  is  of 
size  N  =  LP.   Therefore,  for  summing, 

Tp(N)   =   ©(L  +  log  P)   =   ©s (L)  . 
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IV.  ALGORITHMS 


This  section  contains  supersaturated  paracomputer 
algorithms  for  solving  a  wide  class  of  problems.  We 
continue  to  use  P  for  the  processor  size  and  N  for  the 
problem  size.  Some  of  the  algoriths  assume,  for  the  sake  of 
clarity,  that  P  divides  N  (i.e.  N  =  LP);  they  are  easily 
generalized  for  P  not  dividing  N. 


Since  many  algorithms  have  synchronization  points,  i.e. 
points  that  all  PEs  must  reach  before  any  PE  passes,  it  is 
important  to  note  the  following  cons t ant- t ime  algorithm  for 
synchronization:  Let  xl,  x2,  x3  be  three  (otherwise  unused) 
global  variables  with  xl  and  x2  initially  0.  For  the  first 
synchronization  point,  each  PE  replace-adds  1  to  xl  and 
waits  until  xl  =  P.  At  this  time  the  PEs  are  synchronized. 
Some  one  of  the  PEs  sets  x3  equal  to  0.  For  the  next 
synchronization  x2  is  used  for  r eplace-adding  and  when  x2 
reaches  P  xl  is  set  to  0.  At  that  time  x3  will  necessarily 
equal  0,  so  for  the  third  synchronization  x3  may  be  used, 
and  then  x2  set  to  0.  The  initial  state  having  been 
reestablished,  we  may  again  use  xl  and  set  x3  to  0,  etc.  (A 
similar  synchronization  algorithm  has  been  implemented  by 
Boris  Lubachevsky  [80].) 
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Note  that  paracompu t ers  can  simulate  ul t racomput ers , 
mesh-connected  computers,  cube-connected  computers,  etc., 
with  only  a  cons t ant- fact  or  time  penalty,  since  any 
interconnection  pattern  can  be  simulated  using  standard 
protocols  for  passing  messages  between  communicating 
processors.  Thus,  in  particular,  all  ult racomput er 
algorithms  can  be  immediately  implemented  on  a  paracomputer 
with  the  same  order  of  time  complexity. 

S  umming 

Suppose  that  we  are  given  an  array  W  =  W ( 1 ),..., W (N )  of 
N  =  PL  values,  and  wish  to  compute  the  partial  sums 
S(i)  =  W(l)  +  ...  +  W(i)  for  i  =  1,...,N.  An  ultracomputer 
can  solve  this  problem  in  time  ©(L  +  log  P)  (see  Schwartz 
[80]  and  Gottlieb  and  Kruskal  [80]).  By  simulating  an 
ultracomputer,  a  paracomputer  can  also  solve  this  problem  in 
time  ©(i,  +  log  P).  Thus  summing  is  completely 
parallelii^able  for  L  =  f2(log  P). 

Extensions  As  in  Schwartz  [80] ,  the  summing  algorithm  may 
be  generalized  by  substituting  any  associative  binary 
operation  for  addition.  Moreover,  the  algorithms  for 
"summing  by  groups"  and  for  "taking"  presented  in  Schwartz 
[80]  have  analogous  supersaturated  formulations.  Note  that 
if   only   S(N)   (the  total  "sum")  is  desired,  more  efficient 
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algorithms  exist  for  certain  binary  operations.  For 
example,  the  maximum  of  N  values  can  be  determined  in  time 
a(L  +  log  log  P)  (see  Shiloach  and  Vishkin  [80]  and  Kruskal 
and  Teller  [81] ) . 


Integer  summing  When  finding  the  partial  sums  of  N 
integers,  we  can  make  heavy  use  of  the  replace-add  to  solve 
the  problem  in  time  0(N/P  +  log  log  P)  by  adapting  Kruskal 
and  Teller's  implementation  of  the  Valiant  [75]  algorithm 
for  finding  the  maximum. 

First  consider  the  dependent-size  problem  where 
P  =  N(N-l)/2.  This  problem  is  easily  solved  in  constant 
time:  Assign  the  first  PE  the  task  of  finding  S(l)  (the 
first  partial  sum),  the  next  two  PEs  the  task  of  finding 
S(2),  the  next  three  PEs  the  task  of  finding  S(3),  etc  The 
partial  sums  S(i)  can  be  independently  determined  in 
constant  time  by  initially  setting  S(i)  =  0  and  then 
replace-adding  W ( 1 ),..., W ( i )  to  S(i). 

Next  assume  merely  P  >  N;  this  problem  can  be  solved 
with  the  following  algorithm: 

If  N  =  1  set  S(l)  =  W(l)  and  return.  Otherwise  perform  the 
following  five  steps. 

(1)  Partition  the  N  items  into  g  =  fN  /(2P+N)'l  groups 
Gl,...,Gg  each  of  size  s  ~  (2P+N)/N,  so  that  the 
first  s  items  are  in  group  Gl,  the  next  s   items   in 
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G2 ,  et c  . 

(2)  Partition  the  PEs  also  into  g  groups  with  s(s-l)/2  =^ 
P(2P+N)/n2  pes  in  each. 

(3)  Assign  each  group  of  PEs  to  a  distinct  group  of 
items,  and  solve  the  summing  problem  for  each  group 
independently  using  the  preceding  dependent-size 
integer  summing  algorithm. 

(4)  Apply  this  algorithm  recursively  to  the  total  sums 
T(i)  of  each  group  Gi,  thereby  producing 
U ( 1 )  ,  .  . .  ,U  (g  )  —  the  partial  sums  of  the  T(i)'s. 

(5)  Add  U(i-l)  (or  0  if  i  =  1)  to  each  partial  sum  in 
Gi. 

Steps  (1),  (2),  (3),  and  (5)  each  requires  constant 
time  and,  since  P  _<  N(N-l)/2,  the  depth  of  the  recursion  at 
step  (4)  is  ©(log  log  N  -  log  log(P/N+l))  (see  Valiant 
[75] ),  so  the  entire  algorithm  requires  time  ©(log  log  N  - 
log  log(P/N+l)).  In  particular,  the  saturated  problem  (i.e. 
N  =  P)  is  solvable  in  time  ©(log  leg  P). 


Finally,  consider  the  case   when   P  <  N,   and   use   the 
following  algorithm: 

(1)  Partition  the  items  into  P  groups  G1,...,GP  each  of 
size  N/P,  so  that  the  first  N/P  items  are  in  group 
GI,  the  next  N/P  items  in  G2,  etc. 

(2)  Apply  the  sequential  summing  algorithm  to  each  group 
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independently . 

(3)  Apply  the  preceding  saturated  integer  summing 
algorithm  to  the  total  sums  T(i)  of  each  group  Gi, 
thereby  producing  U ( 1 )  ,  .  • .  ,U  (P )  —  the  partial  sums 
of  the  T  (i)  's  . 

(4)  Add  U(i-l)  (or  0  if  i  =  1)  to  each  partial  sum  in 
Gi. 

Step  (1)  requires  constant  time,  steps  (2)  and  (4) 
require  ©(L)  time,  and  step  (3)  requires  ©(log  log  P)  time. 
Thus,  the  entire  algorithm  requires  ©(L  +  log  log  P)  time 
and  is  completely  par  allelizab  le  for  L  =  f2(log  log  P). 

Unordered  integer  summing  The  unordered  summing  problem  is 
the  problem  of  finding  the  partial  sums  of  some  one 
unspecified  permutation  of  the  data.  If  the  W(i)  are 
integers  this  sum  can  be  formed  in  time  ©(L):  initialize 
some  temporary  location  T  to  0  and  replace-add  every  W(i)  to 
T.  The  result  of  the  addition  of  W(i)  is  the  partial  sum 
S(i)  . 

Permu t a t  ions 


Suppose  we  are  given  an  array  A  =  A ( 1 )  ,  .  .  .  , A (N )  of  size 
N  =  PL  and  a  permutation  PI  of  1,...,N.  Then  the 
permutation  problem  is  to  permute  A  according  to  PI. 
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Algorithm    One  algorithm  for  solving  this  problem  allocates 

a  temporary  array  T  of  size  N  and  performs  the  following  two 
steps: 

(1)  Copy  A  directly  into  T  (i.e.,  each  PEi  moves  A(iL+j) 

into  T(iL+j)  for  1  £  j  £  L). 

(2)  Copy  T  back  into  A  according  to  PI  (i.e.,   each   PEi 
moves  T(iL  +  j)  into  A(PI(iL  +  j))  for  1  £  j  <.  L). 


Analy  sis    Steps  (1)  and  (2)  both  require  time   ©(L) 


Thus 


the   entire   algorithm   requires   time   ©(L)  =  ©  (L)   and  is 
completely  par alleliz ab le  at  all  levels. 


Variant  Unfortunately,  the  above  algorithm  requires  extra 
storage  proportional  to  N.  When  PI  is  known  in  advance, 
i.e.  PI  is  not  part  of  the  data,  the  problem  reduces  to  the 
static  permutation  problem  (see  Gottlieb  and  Kruskal  [80] ) 
which  is  solvable  in  time  ©(L)  using  extra  storage 
proportional  only  to  P:  Partition  A  into  R  and  S  where 
|Rl  =  P  and  |S|  =  N-P.  Copy  R  into  a  temporary  array  R' 
(thus  A  =  R'  (disjoint)  UNION  S).  Store  into  R  (from  R'  and 
S)  the  items  in  PI~''-(R)^  (Note  that  PI  and  hence  PI~^  are 
known  in  advance.)  Store  the  items  of  R',  that  have  not 
been  placed  back  into  R,  into  the  free  locations  of  S.  Now 
the  problem  has  been  reduced,  in  constant  time,  from  A  of 
size  N  to  S  of  size  N-P.  L  such  iterations  will  effect  the 
entire  permutation. 
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Packing 

Suppose  we  are  given  an  array  A  =  A ( 1 )  ,  •  .  .  , A (N )  of  N  = 
PL  items,  some  of  which  are  marked.  The  packing  problem  is 
to  move  the  ith  marked  item  to  the  ith  location  of  A. 

Algorithm 

(1)  Use  (supersaturated)  integer  summing  (with  marked 
items  assigned  1  and  unmarked  items  assigned  0)  to 
determine  the  desired  destinations  of  the  marked 
i  t ems . 

(2)  Partition  the  array  A  into  L  blocks  of  P  contiguous 
items.     Perform   the   following   three   steps   for 

(a)  Each  PEi  stores  the  ith  item  of  the  kth 
block  into  a  (distinct)  temporary  location 
T(i)  . 

(b)  Each  PEi  whose  associated  item  T(i)  is 
marked  moves  the  item  from  T(i)  into  its 
desired  destination  in  A. 


Analy  sis  Step  (1)  is  supersaturated  integer  summing  and 
thus  requires  time  © (L  +  log  log  P),  and  step  (2)  consists 
of  L  iterations  of  two  ©(1)  operations  and  thus  requires 
time  ©(L).  Therefore,  the  entire  algorithm  requires  time 
©(L  +  log  log  P)  =  ©  (L)  and  is  completely  par allelizable 
for  L  =  ^(log  log  P)  . 
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Variants  The  unordered  packing  problem  is  the  same  as  the 
packing  problem,  except  that  is  is  unnecessary  for  the 
marked  items  to  maintain  their  original  relative  order. 
This  problem  can  be  solved  in  time  ©(L)  by  replacing 
supersaturated  summing  with  supersaturated  unordered  summing 
in  step  (1)  above.  Thus  the  unordered  packing  problem  is 
completely  parallelizable  at  all  levels. 

Unfortunately,  this  algorithm  requires  extra  storage 
proportional  to  N.  Unordered  packing,  however,  can  be 
realized  in  time  ©(L)  using  extra  storage  proportional  only 
to  P:  delete  step  (1)  and  begin  step  (2)  with  a  replace-add 
to  determine,  at  the  kth  iteration  of  step  (2),  the  desired 
destination  of  the  items  in  the  kth  block. 


Actually,  the  packing  problem  posed  in  Schwartz  [80] 
not  only  requires  that  the  marked  items  end  up  in  the  top  of 
A,  but  also  that  the  unmarked  items  end  up  in  the  bottom  of 
A.  This  variant  is  easily  solved  by  making  a  copy  of  A, 
packing  the  marked  items  into  the  top  of  A,  and  then  using 
the  copy  to  pack  the  unmarked  items  into  the  bottom  of  A. 
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Me  r ging 

Valiant  [75]  defines  the  following   abstract   model   of 

computation : 

The  processors  are  synchronized  so  that  within  each 
time  interval  each  of  them  completes  a  comparison.  At 
the  end  of  the  interval  the  algorithm  decides,  by 
inspecting  the  ordering  relationships  that  have  already 
been  established,  which  P  (not  necessarily  disjoint) 
pairs  of  elements  are  to  be  compared  during  the  next 
interval,  and  assigns  processors  to  them.  The 
computation  terminates  when  the  relationships  that  have 
been  discovered  are  sufficient  to  specify  the  solution 
to  the  given  problem. 

Such  a  computation  becomes  an  algorithm  when  a  method  is 
given  for  allocating  the  PEs  to  the  proper  comparisons. 
Using  a  shared  memory  machine  —  essentially  a  paracomputer 
without  replace-add  (see  Section  V)  —  Shiloach  and  Vishkin 
[80]  have  proscribed  the  PE  allocation  needed  to  implement 
Valiant's  computation  for  finding  the  maximum;  but,  they 
say  that  "there  is  no  apparent  way  to  overcome  the 
allocation  problem"  for  Valiant's  merging  computation. 
Using  the  paracomputer  model,  Kruskal  [81c]  solves  the 
allocation  problem  and  thereby  obtains  an  algorithm  for 
merging  two  sorted  lists  of  sizes  m,  n,  where  m  2.  i^  ^iid 
N  =  m+n,  in  time 

©(1)  if  P  =  J2(mn) 

©(log  log  n  -  log  log(P/N))        if  00  (/mn)  =  P  =  o  ( mn )   * 

©(N/P  +  log  log  n)  if  P  =  0(/mn)  . 


*  We  say  that  f  =  (jO  ( g )  if  g  =  o(f) 
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In  particular,  if  m  =  n  the  complexity  is 


©(1) 

e(log  log  N  -  log  log(P/N)) 

©(N/P  +  log  log  N) 


if  P  =  J2(N^) 

if  to(N)  =  P  =  o(N-) 

if  P  =  0  (N)  . 


Thus,    merging     is     completely     par allelizable 
L  =  n(log  log  n) .         . 


for 


S  or  t  ins 


Valiant  [75]  applies  the  merging  computation  to  obtain 
a  computation  for  sorting.  In  Kruskal  [81c]  ,  we  specify  the 
PE  allocation  needed  to  transform  the  computation  into  a 
paracomputer  algorithm.  This  algorithm  sorts  N  items  in 
t  ime 


©d) 


if  P  =  iiCN"^) 


©((log  N  -  log(P/N) ) (log  log  N  -  log  log(P/N)) 

i  f  LO  (N)  =  P  =  o(n2) 

©(N  log  N  /  P  +  log  P  log  log  P)     if  P  =  0(N)  . 

Thus,    sorting     is     completely     par allelizable     for 

L  =  n (log  log  P) . 

An  imp  or t ant  special  case  Suppose  we  wish  to  sort  an  array 
A  consisting  of  N  (not  necessarily  distinct)  integers  in  the 
range  1  to  N.  The  following  algorithm  solves  this  simpler 
problem  in  time  ©  (L  4-  log  log  P). 

(1)  Create  an  array  C  of  size  N  initialized  to  0. 

(2)  Count  how  many  items  have  each  value  by  incrementing 
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(via  replace-add)  C(A(i))  for  all  i  e  {1,...,N}. 

(3)  Apply  integer  summing  to  C  and  then  set 
D(i)  =  C(i-l)  (and  D(l)  =  0),  so  that  D(i)  is  the 
number  of  items  less  than  i. 

(4)  Copy  A  into  a  temporary  array  T. 

(5)  The  final  location  j  of  the  ith  original  item  is 
obtained  as  replace-add (D (T ( i )), 1 ) .  Set  A(j)  equal 
to  T(i)  . 


To  illustrate  this  algorithm  consider  the  problem  of 
sorting  the  array  A  =  (2,1,5,3,2).  After  step  (2)  above 
C  =  (1,2,1,0,1),  where  C(i)  is  the  number  of  items  with 
value  i  (e.g.  two  items  have  value  2  and  no  items  have 
value  4).  At  step  (3)  summing  is  applied  to  transform  C 
into  (1,3,4,4,5);  C(i)  now  represents  the  number  of  items 
less  than  or  equal  to  i.  D  =  (0,1,3,4,4)  is  derived  from  C 
by  shifting  the  values  of  C  right  one  position  and 
represents  the  number  of  items  less  than  i.  At  step  (4)  A 
is  copied  into  T.  Finally  at  step  (5)  the  ith  item  of  T 
determines  its  final  location  by  r eplace-adding  1  to 
D(T(i)).  For  example,  the  fourth  item  of  T  is  3  so  its 
final  destination  in  A  is  D(3)+l  =  4.  More  interestingly, 
since  the  first  and  fifth  items  of  T  are  both  2,  they  both 
replace-add  1  to  D(2)  to  determine  their  final  destinations; 
one  of  them  effects  the  replace-add  first  and  its  final 
destination  in  A  is  D(2)+l  =  2,  and  the   other   effects   the 
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replace-add   second   and   its   final   destination   in   A   is 
D (2)+l+l  =  3. 


Steps  (1),  (2),  (4),  and  (5)  all  require  time  ©(L)  and 
step  (3)  requires  time  ©(L  +  log  log  P).  Therefore  the 
entire  algorithm  requires  time  ©(L  +  log  log  P)  =  ©  (L)   and 


its    speedup   is   ©(LP/(L  +  log  log  P)) 
completely  parallelizable  for  L  =  ^(log  log  P) 


©3(P) 


It   is 


An  alt  e  mate  a  Igor  i  thm  with  good  ave  rage  case  behavior  We 
now  describe  a  parallel  version  of  quicksort  and  show  that 
its  average-case  time  complexity  is  ©(N  log  N  /  P).  Thus, 
using  average-case  analyses,  comparison-exchange  sorting  is 
completely  parallelizable  at  all  levels. 

Suppose  we  wish  to  sort  an  array  A  of  N   items.    First 
consider  the  saturated  problem  (i.e.  N  =  P). 


If  N  _<  1  then   A   is   sorted.    Otherwise   perform   the 
following  steps. 

(1)  Choose  an  item  M  at  random  from  A. 

(2)  Let  S,  E,  B  be  the  sets  of  items  smaller  than  M, 
equal  to  M,  and  bigger  than  M,  respectively.  Apply 
unordered  packing  three  times:  first  to  pack  the 
items  of  S  to  the  beginning  of  A,  then  to  pack  the 
items  of  E  immediately  after,  and  finally  to  pack 
the  items  of  B  to  the  end  of  A. 
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(3)  Assign  |S|  PEs  to  S  and  |B|  PEs  to  B,  and 
recursively  apply  the  algorithm  to  S  and  B 
concurrently • 

We  now  analyze  this  algorithm  under  the  assumption  that 
the  items  are  all  distinct,  which  cannot  decrease  the 
(average)  execution  time.  Suppose  that  the  item  M  chosen 
during  step  (1)  is  th a  ith  smallest  item  in  A.  Then  the 
algorithm  is  recursively  applied  to  sets  of  size  i-1  and 
N-i.  Since  i  is  uniformly  distributed  over  {1,...,N>,  we 
are  essentially  constructing  a  random  binary  search  tree  of 
size  N.  The  expected  depth  of  the  recursion  is  the  expected 
height  of  this  tree,  which  is  known  to  be  ©(log  N)  (see 
Robson  [79]).  Since  only  a  constant  length  of  time  is 
required  for  steps  (1)  and  (2)  (see  section  on  packing),  the 
entire  algorithm  requires  time  ©(log  N);  and  since  N  =  P, 
the  speedup  is  ©(P  log  P)/©(log  P)  =  ©(P). 


For  N  >  P ,  we  employ  the  above  algorithm  but  assign 
each  PE  the  work  performed  by  at  most  fN/Pl  PEs  in  the 
saturated  case  (as  shown  in  Kruskal  [81a]).  This  gives  a 
time  complexity  of  © (N /P ) © ( logN )  =  ©(N  log  N  /  P),  and  a 
speedup  of  ©(N  log  N)/©(N  log  N  /  P)  =  ©(P). 
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As  a  practical  consideration,  choosing  M  likely  to  be 
near  the  true  median  lowers  the  average-case  time  complexity 
of  the  algorithm  (but  not  its  order).  One  possibility  is  to 
use  the  median  of  a  random  sample  of  some  R  <  N  items.  If 
we  choose  R  =  0(/p)  the  median  can  be  found  in  only  constant 
time  by  sorting  (see  Kruskal  [81c]  ). 

Generalized  Me  dian  Finding 

Suppose  we  are  given  an  array  A  of  N  items  from  an 
ordered  set  and  an  integer  1  ^  k  _<  N ,  and  wish  to  find  the 
kth  smallest  item  in  the  array.  When  N  =  P  we  know  of  no 
algorithm  faster  than  sorting,  which  has  complexity 
©(log  P  log  log  P)  and  speedup  ©(P  /  (log  P  log  log  P)). 
However,  for  the  supersaturated  case  we  can  parallelize  the 
linear  sequential  algorithm  of  Blum  et  al.  [72]  as  follows. 


Algorithm    If  N  £  P  sort  the  items;   the  kth  smallest   item 
is  A(k).   If  N  >  P  perform  the  following  four  steps: 

(1)  Partition  the  items  into  P  groups  of  size 
(essentially)  N/P.  Assign  PEi  to  the  ith  group  and 
use  the  sequential  fast  median  algorithm  to  find  the 
median  item  in  each  group. 

(2)  Sort  these  medians  to  find  M,  the  median  of  the 
local  medians. 

(3)  Let  S,  E,  and  B  be  the  sets  of  items  smaller  than  M, 
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equal   to   M,   and  bigger  than  M,  respectively.   Use 
unordered  summing  to   determine   |Sl   and   |E|    (the 
cardinalities  of  the  sets  S  and  E). 
(4)   Perform  one  of  the  following  three  steps: 

(a)  k  _<  |S  I  :  Pack  S  using  unordered  packing  and 
then  recursively  apply  this  (generalized- 
median  finding)  algorithm  to  S,  still 
searching  for  the  kth  smallest  item. 

(b)  |S|  <  k  <_  |S|  +  |E|:  The  kth  smallest  item 
is  M. 

(c)  |S|  +  |E|  <  k:  Pack  B  using  unordered 
packing  and  then  recursively  apply  this 
(generalized-median  finding)  algorithm  to  B, 
but  now  searching  for  the  k  -  |S|  -  |E| 
smalles  t  i t  em . 


Analy  sis  The  important  property  of  this  algorithm  is  that 
at  each  recursive  application  at  least  a  quarter  of  the 
remaining  items  are  eliminated  from  consideration.  After 
log^y'3L  recursions,  the  number  of  items  remaining  is  no  more 
than 

N(3/A)    ^'^  =   N/L   =   P 

at   which   point   we   apply   the   sorting  algorithm.  The 

complexity    of    step    (1)   is   ©(N/P),  of   step   (2)  is 

©(log  P  log  log  P),  of  step  (3)  is  ©(N/P),  and  of   step  (4) 


T  (N)   =  -; 
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is  @(N/P  +  T  ((3/4)N)),  where  T  (N)  is  the  complexity  of  the 
entire  algorithm.   Thus,  the  complexity  satisfies 

'"©(logPloglogP)  ifN_<P 

_©(log  P  log  log  P  +  N/P  +  Tp((3/4)N))    if  N  >  P 
=   ©(L  +  (log  L  +  Dlog  P  log  log  P) 
=   ©^(L) 
giving  a  speedup  of 

©(PL  /  (L  +  (log  L  +  Dlog  P  log  log  P)) 
=   ©^(P)  .   .     " 
Hence   the   algorithm   is   completely    parallelizab le    for 
L  =  i^dog  P  (log  log  P)^)  . 

An  a  It  erna  te  a  Igor i  thm  with  good  ave  rage  case  behavi  or  In 
the  remainder  of  this  subsection  we  describe  an  algorithm 
for  finding  the  kth  smallest  item  that  has  average-case  time 
complexity  ©(L  +  log  P).  For  L  =  o(log  P  (log  log  P)~), 
this  alternate  algorithm  is  asymptotically  more  efficient 
than  the  above  algorithm.  Moreover,  it  has  significantly 
less  overhead  and  is  therfore  faster  on  average  for  all  L. 


This  algorithm  is  similar  to  the  above  one,  the  main 
differences  being  that  sorting  is  eliminated  and  that  steps 
(1)  and  (2)  above  are  replaced  by  choosing  M  randomly. 

(1)  Choose  an  item  M  at  random  from  the  set. 

(2)  Let  S,  E,  and  B  be  the  sets  of  items  smaller  than  M, 
equal   to   M,   and  bigger  than  M,  respectively.   Use 
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unordered  summing  to   determine   |S|   and   |E|,   the 
cardinalities  of  the  sets  S  and  E. 
(3)   Perform  one  of  the  following  three  steps: 

(a)  K.  _<  |S  I  :  Pack  S  using  unordered  packing  and 
then  recursively  apply  this  (generalized- 
median  finding)  algorithm  to  S,  still 
searching  for  the  kth  smallest  item. 

(b)  |S|  <  k  _<  |S|  +  lEj:  The  kth  smallest  item 
is  M . 

(c)  |S|  +  |E|  <  k:  Pack  B  using  unordered 
packing  and  then  recursively  apply  this 
(generalized-median  finding)  algorithm  to  B, 
but  now  searching  for  the  k  -  |S|  -  |E| 
smalles  t  item . 


We  now  analyze  this  algorithm  under  the  assumption  that 
the  (values  of  the)  items  are  all  distinct.  This  assumption 
cannot  decrease  the  (average)  execution  time.  Let  Tp(N)  be 
the  expected  time  complexity  of  this  algorithm  and  suppose 
that  M  is  the  ith  smallest  item  of  the  set.  If  i  >  k,  the 
algorithm  is  applied  to  a  set  of  i-1  items,  and  if  i  <  k, 
the  algorithm  is  applied  to  a  set  of  n-i  items.  Thus,  the 
expected  time  for  the  recursion  in  step  (3)  is 
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1 


k-1 


Tt(   I       Tp(N-i) 

i=l 

N-1 
1(   Z   T  (i) 
'^  i=N-k+r 


N 

E   Tp(i-l)  ) 
i=k+l 


N-1 

E   T  (i)  ) 
i  =  k   ^ 


First  assume  N  _<  P .  Then  steps  (1)  and  (2)  and  the 
packing  in  step  (3)  require  constant  time  c,  so  we  have  the 
following  inequality: 

N-1  N-1 


T  (N)   <   c   +   MAX  {  ^(  I  T  (i) 

l<k<N      i=N-k+l 


I   T  (i)  )  } 
i  =  k 


Lemma:    T  (N)   _<   a  In  N  +  b,   where   a  =  c/(l  -  In  2)   and 
b  =  Tp(l).    . 

We  prove  this  lemma  by  induction   on   N.    The   formula 
obviously  holds  for  N  =  1.   Next  suppose  that  N  >  1  and  that 

the  formula  holds  for  i  <  N.   Then  T  (N) 

P 


1 


N-1 


N-1 


<       c       +      MAX{i(    E   (a  In  i  +  b)   +    E  (a  In  i  +  b)  )  > 
k   ^  i=N-k+l  i=k 


N-1 
c   +   ^MAX{    E    In  i 


N 


k   i=N-k+l 


N-1 

E  In  i}   +   b(N-l)  . 
i  =  k 


Since  In  x  is  mono t oni cally  increasing, 

N-1  N-1 

E   In  i   +  E  In  i 

i=N-k+l  i=k 

N  N 

<_   /   In  X  dx   +   /   In  X  dx 

N-k+1  k 


N  In  N   -   N   -   (N-k+1) ln(N-k+l)   +   (N-k+1) 
+   NlnN   -   N   -   kink   +   k. 
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Allowing  k  to  assume  arbitrary  real  values,  the  maximum  is 
obtained  at  k  =  (N+l)/2.   Thus  Tp(N) 

<  c  +  ^(2NlnN  -  2N  -  (N+l)ln(I±i)  +  (N+D)  +  b(N-l) 

—  N  2 

<  c  +  a(21nN  -  ln(N+l)  -  (l-ln2)  -   ( In (lil ) - 1 ) ) 

-  2 

+  b(l-l)  . 

Rewriting    ln(N+l)     as    InN    +    ln(l+j[-)     and    noting    that 

ln(l+^)        =        (-jj-    YIJ2)     +     (-JJ5-3     -    -4^^)     +     ••• 

>       I    -    ^_ 
N  2N2 

gives    Tp(N) 

1  1  N+1 

_<       c    +    a(21nN    -    InN    -    f    +    2N~    "     (l-ln2)     -       (ln(— 2" )-l)) 

+  b(l-l) 
=   c  +  a(lnN  -  (l-ln2)  -   (ln(-^)  --J^)  )    +  bd-^) 

<  alnN  +  b  -  a(l  -  ln2  +   (ln(^^)  -  ^))  . 

Since  (for  N  >_  2  )  ln((N+l)/2)  -  1/2N  >  0  and  a(l  -  In  2)  =  c 
(by  definition)  we  get  Tp(N)  <_  a  In  N  +  b  which  completes 
the  indue  t ion . 


Now  assume  N  i  P.  In  this  case  steps  (1)  and  (2)  and 
the  packing  part  of  step  (3)  require  at  most  C(N/P)  time  for 
some  constant  C,  so  we  have  the  following  inequality: 


T  (N)   _<   Cy 


N-1 
MAX  {  -^(    Z 
l<k<N     i=N-k+l 


1 


Tp(i) 


N-1 
+    I        Tp(i)  )  } 
i  =  k 


We  shall  show  by  induction  on  N  that  Tp(N)  _< 
d(N/P  +  In  P  +  1),  for  any  d  >.  MAX{a,b,4C}.  In  case  N  _<  P , 
by  the  lemma  we  have  immediately  that  Tp(N)  _<  a  In  N  +  b  _< 
d(ln  N  +  1  )  _<  d(N/P  +  In  P  +  1).   Now  suppose  that  N  >  P  and 
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that  the  formula  holds  for  i  <  N.   Then  Tp(N) 

N-1      .  N-1 

<  Cy  +  MAX{^(     Z    d(^+lnP+l)   +    ^  d(7+lnP+l))> 

k     i=N-k+l  i=k 

<  Cp-  +  MAX{4(7((N-l)N/2  -  (N-k)  (N-k+1)  /2 

k 

+  (N-l)N/2  -  (k-l)k/2)  +  (N-l)(lnP  +  1))} 

Again  the  maximum  is  attained  at  k  =  (N+l)/2,  so 

T  (N)   <   cli  +  -^C  J.(N(N-1)  -  (Hnl)  (N±i)  )  +  (N-l)(ln  P  +  1)  ) 
P      -    P    N   P  2      2 

=       Cii    +    d  (    N/P    -     1/P    -    N/(AP)     +    1/(4NP)     +    In    P 
^  -lnP/N+1-    1/N     ) 

_<       CN    +    d  (     (3/4)N/P    +    In    P    +    1     )     . 

Since  d  i  4C  (by   definition),   Tp(N)   _<   d(N/P  +  In  P  +  1), 

which  completes  the  induction- 


Since  d  is  a  constant,  Tp(N)  =  0 (L  +  log  P).  A  similar 
analysis  yields  an  equivalent  lower  bound  for  this  algorithm 
and  therefore  Tp(N)  =  ©(L  +  log  P).  Thus,  on  average,  the 
speedup  is  ©(L/(L  +  log  P))  and  the  problem  is  completely 
p arallelizab le  for  L  =  fi(log  P). 


As  'n  the  parallel  quicksort  algorithm  (see  section  on 
sorting)  it  is  advantageous  in  step  (1)  to  spend  some  time 
to  pick  a  desirable  M.  One  possibility  is  to  sort  a  random 
sample  of  some  R  items,  and  then  pick  M  to  be  the  (Rk/N)th 
smallest  Item  in  the  sample.  Once  again  R  =  0  ( "^P )  it 
be  sorted  in  constant  time. 


ems  can 
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Fast  Fourier  Transform 


Since  an  ul t racompu ter  can  realize  the  FFT  in  time 
S(N  log  N  /  P),  (see  Gottlieb  and  Kruskal  [80]),  so  can  a 
par acomputer .  Therefore,  the  FFT  is  completely 
par allelizab le  at  all  levels. 

Set  and  Mar^  Operations 


Gottlieb  and  Kruskal  [80]  discuss  algorithms  for 
performing  set  and  map  operations  (e.g.  union,  intersection, 
function  composition,  and  range  determination)  on  an 
ul t ra computer •  Their  algorithms  are  implemented  using 
sorting  and  packing  as  basic  operations,  and  are  immediately 
transferable  to  paracompu t er s  using  the  above  sorting  and 
packing  algorithms.  Sorting  is  the  dominant  step  in  all 
these  algorithms,  so  the  complexities  and  speedups  attained 
for  set  and  map  operations  are  the  same  as  given  above  for 
sorting.  Note  that  performing  set  operations  when  the  sets 
of  items  are  already  sorted  requires  only  the  complexity  of 
mer gi  ng . 
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V.  SUMMARY  AND  CONCLUSIONS 


In  this  section  we  summarize  our  results  and  compare 
them  with  those  obtained  for  two  other  important  parallel 
architectures:  the  shared  memory  machine  (essentially  a 
paracomputer  without  the  replace-add)  and  the  ul t racompu t er . 

Table  I  presents  the  performance  figures  for 
sequential,  saturated,  and  supersaturated  v^  rsions  of  the 
paracomputer  algorithms  discussed  above.  Although  all  of 
these  problems  are  completely  par allelizab le  in  the 
supersatur at  ion  limit  (i.e.  they  attain  maximal  speedup  for 
large  enough  problems),  problems  do  exist  for  which  all 
algorithms  attain  only  minimal  speedup  (see  Kruskal  [81b]). 

Table  II  compares  the  speedups  (at  the  super sat urat ion 
limit)  of  ul t racompu t ers ,  shared  memory  machines,  and 
para comput er s .  The  ul t racompu ter  (Schwartz  [80])  is  an 
efficient  message  passing  architecture  based  on  the  perfect 
shuffle  interconnection  (Stone  !71]).  The  shared  memory 
machine  has  a  common  memory,  allows  multiple  reads  and 
writes  to  the  same  location,  and  has  the  ability  to 
synchronize  the  PEs ,  i.e.  it  is  a  paracomputer  without  the 
replace-add  but  with  a  simple  synchronization  primative. 
Table  II  also  indicates  --  in  parentheses  --  the  level  at 
which  the  given  speedups  are  reached.  The  ul t racomput er 
results  are  primarily  from  Gottlieb  and  Kruskal  [80] . 
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It  can  be  seen  from  table  II  that,  for  all  of  the 
problems  discussed  above,  both  the  paracomputer  and  shared 
memory  machine  yield  maximal  speedup  in  the  supersa tur at  ion 
limit,  but  that  the  paracomputer  sometimes  reaches  the  limit 
somewhat  earlier.  The  paracomputer  programs  are  often 
simpler  and  contain  less  overhead. 

The  ult ra computer ,  however,  does  not  always  permit 
maximal  speedup,  e.g.  the  permutation  problem  has  a  speedup 
of  only  ©(P  /  log  P)  (see  Gottlieb  and  Kruskal  [80]).  Even 
when  a  problem  is  completely  parallelizab le  on  an 
ul t ra comput er ,  it  sometimes  reaches  its  super satur at  ion 
limit  significantly  later  than  does  a  paracomputer,  e.g. 
consider  sorting. 


The  paracomputer  performances  for  solving  the  above 
problems  are,  in  our  opinion,  quite  impressive.  Adding  to 
this,  their  ability  (as  noted  earlier)  to  realize  highly 
concurrent  operating  system  primitives,  makes  par acomput er s 
an  architecture  worth  striving  for.  While  fan-in  and  other 
limitations  prevent  their  physical  realization,  they  can, 
however,  be  reasonably  approximated  by  machines  using  a 
multistage  interconnection  network  (see  Gottlieb  et  al. 
[80,81]  and  Gottlieb  and  Schwartz  [81]).  The  "ul t racompu t er 
group"  at  the  Courant  Institute  of  New  York  University  is 
presently  designing  a  prototype  of  such   a   machine   and   we 


Page  30 


feel   that  a  full  scale  version  containing  tens  of  thousands 
of  PEs  will  be  cons t rue  tab le  by  the  early  1990's. 
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Problem  U 1 1 r a compu t er    Shared  Memory    Paracomputer 


Summing  P  P  P 

(log  P)  (log  P)  (log  P) 


Integer  P  P  P 

Summing  (log  P)  (log  P)  (log  log  P) 


Unordered        P  P  P 

Integer  (log  P)  (log  P)  (1) 

Summing 

Maximum  P  P  P 

(log  P)      •     (log  log  P)  (log  log  P) 


Permuting        P  /  log  P         P  P 

(1)  *  (1)  (1) 


Packing  P  /  log  P        P  P 

(1)  (log  P)  (log  log  P) 


Unordered        P  /  log  P         P  P 

Packing  (1)  (log  P)  (1) 


Merging  P  /  log  P         P  P 

(1)  (log  P)  (log  log  P) 


Sorting  P  P  P 

(Pl°8  ^)  (log  P)  (log  log  P) 


Sorting  P  P  P 

Average-  (pl°8  ^)  (log  P)  (1) 

Case 


Speedup  in  super sat ur at  ion  limit 
(Level  at  which,  super satur at  ion  limit  is  reached) 

TABLE  II 
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Prob  letn 


U 1 t racompu t er    Shared  Memory    Paracomputer 


Sorting 
(1, . • • ,N) 


P  /  (log  P) 
(1) 


(log  P) 


(log  log  P) 


Me  d  ian 


Median 
Ave  rage- 
Case 

Set  and  Map 
Operations 


P  /  log  P 
(  (log  log  P) 
log  P) 

P  /  log  P 
(log  P) 


(plOf,   P) 


((log  log  P) 
(log  P)2) 


(  (log  log  P) 
log  P) 

P 

(log  P) 


((log  log  P) 2 
log  P) 


(log  P) 


(log  log  P) 


FFT 


P 
(1) 


P 
(1) 


P 

(1) 


*  For  the  dynamic  permutation  problem   ult racomputers   reach 
the  supersatur ation  limit  only  at  L  =  A(P  "     /  log  P). 


Speedup  in  super sat urat ion  limit 
(Level  at  which  sup er sat urat ion  limit  is  reached) 

TABLE  II  (continued) 
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